12 Generalized Additive Models

Chelsea Parlett-Pelleriti

Non-Linear Linear Regression

Generalized Linear Model Review

\[ \mathbb{E}\left (y | \mathbf{X}\right ) = \mu = g^{-1}\left (\mathbf{X}\beta \right ) \\ p\left (y|x\right ) \sim \pi\left (\mu,...\right ) \]

  • link function: \(g()\)

  • likelihood function: \(\pi()\)

Generalized Linear Model Review

Generalized Linear Models Review

\[ \mathbf{X}\beta \]

linear in the parameters”: the linear prediction is just a linear combination of the columns of the design matrix, \(\mathbf{X}\)

Generalized Linear Models Review

\[ \mathbf{X}\beta \]

linear in the parameters”: the linear prediction is just a linear combination of the columns of the design matrix, \(\mathbf{X}\)

❗But no one said the columns had to be “linear”

Creating Non-Linearity

  • Link Functions (Generalized Linear Models)

  • Feature Engineering (Additive Models)

Feature Engineering

Feature Engineering: Polynomials

Feature Engineering: Polynomials

Feature Engineering: Polynomials

Feature Engineering: Polynomials

Design Matrix: a matrix containing all the inputs to your model after being processed (e.g. one hot encoding, polynomial basis expansion)

\[ \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 0 \\ 1 & 1 \\ \end{bmatrix} \to \begin{bmatrix} 1 & 1 & 1 & 1\\ 1 & 2 & 4 & 8\\ 1 & 3 & 9 & 27\\ 1 & 0 & 0 & 0\\ 1 & 1 & 1 & 1\\ \end{bmatrix} \]

Feature Engineering: Polynomials

We don’t want our polynomials to be too wiggly

Feature Engineering: Piecewise

Additive Models

\[ y_i = \beta_0 + f(x_i) + \epsilon \]


Here, \(y\) is modeled as an intercept plus a smooth function \(f\) of our predictor \(x\).

Note: This is still “linear in the parameters”, we’re just transforming our predictor \(x\) in non linear ways

Additive Models

Consider our Piecewise model:

\[ y_i = \beta_0 + f(x_i) + \epsilon \]

where \(f(x) = \sum_{i=1}^q b_i(x) * \beta_i\) . In this case: \(b_i(x)\) are functions that create dummy variables indicating which range \(x\) is in, and \(\beta_i\) are the means for each range.

Additive Models

Or, consider our polynomial regression:

\[ y_i = \beta_0 + f(x_i) + \epsilon \]

Where \(f(x) = \sum_{i=1}^q b_i(x) * \beta_i\). In this case, \(b_i(x)\) are functions that take \(x\) to the \(i^{th}\) power, and \(\beta_i\) are the corresponding coefficients for each polynomial term.

Additive Models

For \(f(x) = \sum_{i=1}^q b_i(x) * \beta_i\) , \(b_i(x)\) are called basis functions. These are transformations of our original predictor \(x\), and each basis function is added as a predictor to our design matrix.

\[ \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 0 \\ 1 & 1 \\ \end{bmatrix} \to \begin{bmatrix} 1 & 1 & 1 & 1\\ 1 & 2 & 4 & 8\\ 1 & 3 & 9 & 27\\ 1 & 0 & 0 & 0\\ 1 & 1 & 1 & 1\\ \end{bmatrix} \]

Feature Engineering: Piecewise Polynomials

Feature Engineering: Piecewise Polynomials

Knots

When fitting Additive Models and choosing \(f(x)\) to transform our predictor \(x\), we have to consider a few things. For example, for our Piecewise functions, we needed to choose knots.

Knots are points at which the function changes

Currently, the functions are not required to be continuous at the knots. But we could require that. We could also require that the first and second derivatives be equal at the knots so that the function is smooth.

Continuous Functions

Continuous Functions

\[ \text{Income} = 35 + 1.5 * \text{years of ed} + 2.5 * (\text{years of ed} - 13) + \epsilon \]

💡 We can choose to write our features in a way that enforce continuity

Continuous Functions

💡 We can choose to write our features in a way that enforce smoothness

Regression Splines

Regression Splines: continuous, smooth, piecewise polynomial functions of a predictor \(x\) are used to predict \(y\)

Regression Splines

Regression splines allow us to get flexible models without needed a super high degree polynomial.

Regression Splines

❓ How do we create these regression splines that are continuous at each knot, and smooth?



Choosing the right Basis Functions.

Basis Functions

Basis Functions

Basis Functions

Basis Functions \(b_i(x)\) allow us to create smooth functions by multiplying each basis function by a coefficient and adding them together

\[ f(x) = \sum_{i=1}^qb_i(x)*\beta_i \]

But remember: we don’t want our functions to be too wiggly.

Wiggliness

Wiggliness: how quickly a function changes across its range of values

\[ \text{wiggle} = \int \left[ f''(x) \right]^2 \]

Second derivative: the rate of change of the rate of change (derivative of a derivative)

💡 In a wiggly function, what would you expect the second derivative to be like?

Wiggliness

  • first derivative is changing quickly
  • second derivative is large (in magnitude)
  • second derivative squared is large

Wiggliness

We want our function to be just wiggly enough to capture the true relationship in the data, but not so wiggly that it overfits

Wiggliness

(a) High Variance
(b) Low Variance
Figure 1

Wiggliness

(a) High Bias
(b) Low Bias
Figure 2

Wiggliness

One of the ways we can control the amount of wiggliness in our function, is by choosing the number of basis functions.

from: https://noamross.github.io/gams-in-r-course/chapter1

Regularization

\[ \text{Penalized Likelihood} = \text{Likelihood} - \lambda*\text{Wiggle} \]

We can also penalize the model for being too wiggly.

Regularization

We want to choose \(\lambda\) in a way that allows us to fit the appropriate function (low bias) but does not overfit (low variance).

from: https://noamross.github.io/gams-in-r-course/chapter1

Wiggliness

We have two main levers to control wiggliness:

  • \(\lambda\): regularization strength

  • \(k\): number of bases

Additive Models

We’re not limited to one predictor:

\[ y_i = \beta_0 + f_1(x_{1i}) + f_2(x_{2i}) + ... + f_n(x_{ni}) + \epsilon \]


We can use smooth functions of multiple predictor variables to create an additive model.

Generalized Additive Models

Generalized Additive Models just apply the same generalization as Generalized Linear Models:

\[ \mathbb{E}\left (y | \mathbf{X}\right ) = \mu = g^{-1}\left (\mathbf{X}\beta \right ) \\ p\left (y|x\right ) \sim \pi\left (\mu,...\right ) \]

  • link function: \(g()\)

  • likelihood function: \(\pi()\)


But now we know \(\mathbf{X}\beta\) has the form \(\beta_0 + \sum_{i =1}^q f_i(x_{i})\)

GAMs in R

We’re going to use the mgcv package in R to fit GAMs

library(tidyverse)
library(mgcv)
library(gamair)
data(mpg)

gam1 <- gam(city.mpg ~ s(hp),
            data = mpg)

plot(gam1)

GAMs in R

gam2 <- gam(city.mpg ~ s(hp) + s(weight) ,
            data = mpg)

par(mfrow = c(1, 2))
plot(gam2, all.terms = TRUE)

GAMs in R

gam3 <- gam(city.mpg ~ fuel + s(hp) + s(weight) ,
            data = mpg)

par(mfrow = c(2, 2))
plot(gam3, pages = 1, all.terms = TRUE)

GAMs in R

gam4 <- gam(city.mpg ~ s(hp, by = fuel) + s(weight) ,
            data = mpg)

par(mfrow = c(2, 2))
plot(gam4, pages = 1, all.terms = TRUE)

GAMs in R

gam5 <- gam(city.mpg ~ fuel + s(hp, k = 20) + s(weight) ,
            data = mpg)

par(mfrow = c(2, 2))
plot(gam5, pages = 1, all.terms = TRUE)

GAMs in R

gam6 <- gam(city.mpg ~ fuel + s(hp, sp = 1000) + s(weight) ,
            data = mpg)

par(mfrow = c(2, 2))
plot(gam6, pages = 1, all.terms = TRUE)

GAMs in R

summary(gam6)

Family: gaussian 
Link function: identity 

Formula:
city.mpg ~ fuel + s(hp, sp = 1000) + s(weight)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  32.2914     0.6502   49.66   <2e-16 ***
fuelgas      -7.8205     0.6968  -11.22   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
            edf Ref.df     F p-value    
s(hp)     1.004  1.007 31.78  <2e-16 ***
s(weight) 5.791  7.008 44.92  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.868   Deviance explained = 87.3%
GCV =  5.975  Scale est. = 5.7162    n = 203

GAMs in R

In reality, we want to fit our models with method = REML . REML is good for selecting the optimal level of smoothness and can handle a wide variety of effects (e.g. mixed effects) with good numerical stability.

gam7 <- gam(city.mpg ~ fuel + s(hp, k = 10) + s(weight, k = 10) ,
            data = mpg,
            method = 'REML')

par(mfrow = c(2, 2))
plot(gam7, pages = 1, all.terms = TRUE)

GAM Reporting

  • Report edf and significance

  • Show Partial Effect Plots

  • Use marginaleffects (see here)

##GAMs in R

We can also fit Generalized versions of GAMs.

gam9 <- gam(fuel ~ s(hp) + s(weight) ,
            family = binomial(),
            data = mpg,
            method = 'REML')

par(mfrow = c(2, 2))
plot(gam9, pages = 1, all.terms = TRUE)